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Abstract 

Background: Infection of the CNS is considered to be the major cause of encephalitis and more than 100 different 
pathogens have been recognized as causative agents. Despite being identified worldwide as an important public 
health concern, studies on encephalitis are very few and often focus on particular types (with respect to causative 
agents) of encephalitis (e.g. West Nile, Japanese, etc.). Moreover, a number of other infectious and non-infectious 
conditions present with similar symptoms, and distinguishing encephalitis from other disguising conditions 
continues to a challenging task. 

Methods: We used canonical correlation analysis (CCA) to assess associations between set of exposure variable 
and set of symptom and diagnostic variables in human encephalitis. Data consists of 208 confirmed cases of 
encephalitis from a prospective multicenter study conducted in the United Kingdom. We used a covariance matrix 
based on Gini's measure of similarity and used permutation based approaches to test significance of canonical 
variates. 

Results: Results show that weak pair-wise correlation exists between the risk factor (exposure and demographic) 
and symptom/laboratory variables. However, the first canonical variate from CCA revealed strong multivariate 
correlation (p = 0.71, se = 0.03, p = 0.013) between the two sets. We found a moderate correlation (p = 0.54, se 
= 0.02) between the variables in the second canonical variate, however, the value is not statistically significant 
(p = 0.68). Our results also show that a very small amount of the variation in the symptom sets is explained by 
the exposure variables. This indicates that host factors, rather than environmental factors might be important 
towards understanding the etiology of encephalitis and facilitate early diagnosis and treatment of encephalitis 
patients. 

Conclusions: There is no standard laboratory diagnostic strategy for investigation of encephalitis and even 
experienced physicians are often uncertain about the cause, appropriate therapy and prognosis of encephalitis. 
Exploration of human encephalitis data using advanced multivariate statistical modelling approaches that can 
capture the inherent complexity in the data is, therefore, crucial in understanding the causes of human 
encephalitis. Moreover, application of multivariate exploratory techniques will generate clinically important 
hypotheses and offer useful insight into the number and nature of variables worthy of further consideration in a 
confirmatory statistical analysis. 
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Background 

Encephalitis is a complex clinical syndrome of the cen- 
tral nervous system (CNS) associated with fatal outcome 
or severe permanent damage including cognitive and 
behavioral impairment and epileptic seizures [1-5]. It is 
often acute, although symptoms may progress rapidly, 
causing severe debilitation to patients including other- 
wise healthy children [2,3]. Lewis and Glaser define 
encephalitis as an acute CNS dysfunction with radio- 
graphic or laboratory evidence of brain inflammation 
[2]. There is no standard laboratory diagnostic strategy 
for investigation of encephalitis and even experienced 
physicians often are uncertain about the cause, appro- 
priate therapy and prognosis [1-3,6]. 

Despite being identified worldwide as an important 
public health concern, retrospective studies on encepha- 
litis are very few and studies often focus on particular 
types (often with respect to causative agents) of ence- 
phalitis (West Nile, Japanese, etc.). However, there are 
relatively more studies in the pediatric population 
[2,3,7,8]. Moreover, current knowledge about encephali- 
tis is limited to descriptive statistics. As a result, a com- 
prehensive understanding of human encephalitis, as 
generated through high quality evidence-based studies 
and statistical analyses is limited and much of the cur- 
rent knowledge base lacks generalizability [2,9-11]. 

Encephalitis is characterized by fever, headache and 
altered level of consciousness together with seizures and 
focal neurological findings in some cases [1,3,11]. Using 
data from the same prospective study presented in this 
paper, our group previously identified fever, personality 
and behavioural change, headache and lethargy, as the 
main characteristics of human encephalitis [10,11]. It 
was also shown that diagnostic variables such as abnor- 
mal brain scan and cerebrospinal fluid measurements 
are also indicators of encephalitis. Seizures, focal neuro- 
logical deficits, stiff neck, urinary symptoms, respiratory 
symptoms and gastro-intestinal symptoms have also 
been previously shown to be associated with encephalitis 
[1,2,11]. Fowler et al, in retrospective study of paediatric 
encephalitis, found that fever and encephalopathy were 
the main disease characteristics in a Swedish sample [3]. 

Encephalitis is a rare disease, with annual incidence 
ranging between 3.5-7.4 cases per 100,000 persons 
worldwide [1,2,12]. It affects people of all ages; however, 
the condition is more common in children, the elderly 
and persons with a weakened immune system (e.g. HIV/ 
AIDS patients and patients undergoing cancer treat- 
ment). Encephalitis is known to affect both sexes; how- 
ever, most studies have indicated a slightly higher 
incidence rate in males [1,13-15]. The epidemiology of 
encephalitis is difficult to summarize since few popula- 
tion based studies exist, many causal pathogens are 



capable of inducing encephalitis-like symptoms and 
most cases go unreported to health authorities. Conse- 
quently, many details about its epidemiology have yet to 
be explained [1,2,10]. 

To date, infection of the CNS is considered to be the 
major cause of encephalitis and more than 100 different 
pathogens have been recognized as causative agents 
[1,10]. However, an estimated 32-85% of cases have 
unknown disease etiology [1,16-20]. For instance, about 
85% of the 189 cases in a study conducted in Minnesota, 
USA are of unknown cause [20]. In a California based 
study, about 65% of the 334 cases are of unknown etiol- 
ogy [18]. In a study conducted in the UK, about 60% of 
700 cases are of unknown etiology [16]. Among the 
known causes, Herpes Simplex Virus (HSV) has been 
recognized as the most common etiology [1,10,20]. 
Viruses, bacteria, fungi as well as parasites can cause 
encephalitis [1-3]. Rarely, encephalitis can also be trig- 
gered by brain injury, brain tumor, drug reactions and 
lead poisoning. The main infectious causes of encephali- 
tis are listed in a review paper by Granerod and Crow- 
croft [1]. 

In many parts of the world, viral infections of the cen- 
tral nervous system are often spread via vector-borne 
infection, such as mosquito bites and tick bites; how- 
ever, animal-to-human interactions also can facilitate 
disease spread (e.g. raccoon feces, cat scratches, animal 
bites) and human-to-human transmission is also possi- 
ble. Bacteria causing encephalitis can also spread 
through animal contact and water exposure. Possible 
risk factors associated with encephalitis and disease 
pathologies are provided in Lewis and Glaser [2]. 

A number of other infectious and non-infectious con- 
ditions present with similar symptoms and hence a chal- 
lenge lies in distinguishing encephalitis from other 
disguising conditions [1,2,6]. Exploration of human 
encephalitis data using advanced multivariable statistical 
modelling approaches that can capture the inherent 
complexity in the data is, therefore, crucial for elucidat- 
ing the causes of human encephalitis. Moreover, appli- 
cation of multivariate exploratory techniques will 
generate clinically important and better focused hypoth- 
eses that would benefit encephalitis researchers in redu- 
cing the number of variables to be considered for 
further confirmatory statistical analysis. This will ulti- 
mately lead towards better evidence-based clinical prac- 
tices, including: diagnosis, prognosis discovery and 
development of novel therapeutic options. 

In this paper, we use canonical correlation analysis 
(CCA) to explore the relationship between a set of 
exposure variables that are potential risk factors and a 
set of symptom and diagnostic variables in encephalitis. 
The symptom and diagnostic variables considered in 
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this paper include variables that are previously identified 
as main indicators of encephalitis as well as those with a 
potential to be associated with the disease. Our data 
consist mostly of binary variables (presence or absence 
of a particular attribute) and as a result, the usual corre- 
lation matrix which is particularly designed for continu- 
ous measurements is not appropriate. We therefore 
propose to use a correlation matrix based on Gini's idea 
of variance or likeability for categorical variables. 

Methods 

Study population and data description 

Data consists of 268 patients recruited from 24 hospi- 
tals/neurological centers in three geographical locations 
across England (South West, London, North West). 
Measurements from 16 symptom, 6 diagnostic (3 from 
cerebrospinal fluid, 2 from brain scans/images and 1 
electroencephalography) and 13 exposure variables were 
recorded. Age, gender, duration of illness and length of 
hospital stay were also available. Most of the variables in 
the study are binary indicating presence or absence of 
attributes; others have been dichotomized before per- 
forming the CCA analysis. Age is dichotomized where 
one group consisting of young children (age < 10), and 
another group consisting of older children and adults (> 
10 years). Duration of illness is dichotomized as short (< 
100 days) and long (> 100 days) and length of hospital 
stay is dichotomized as short (< 50) and long (> 50). 
These cutoff values are determined using results from 
analysis of univariate distributions. Variables included in 
our study are listed in Table 1. More details about the 
UK encephalitis study can be found in the original 
paper [10]. 

Methods 

We used canonical correlation analysis (CCA) to investi- 
gate the relationship between the set of exposure and 
demographic variables (X) and the set of symptom, clin- 
ical and diagnostic variables (Y) in human encephalitis. 
Canonical Correlation Analysis (CCA) 

Consider two sets of variables X p = {x 1} x 2 , . . xp} and 
Y q = fyi, J2> ♦ . •> y q h measured on n individuals, where 
p and q represent the number of variables in each set. 
Canonical correlation analysis seeks to determine the 
optimal set of min (p, q) linear combinations (called 
canonical variates), a'x = X a t x t and b'y = £ bj yp from 
sets X p and Y q which produce maximum correlation 
[21-25]. That is, the method finds two vectors a = (a 1} 
a 2 a p ) and b = (bj, b 2 , * * b q ) such that the follow- 
ing correlation is maximized. 



Table 1 List of the two sets of variables: One set 
consisting of 13 exposure and 2 demographic variables, 
and a second set consisting of 18 symptoms, clinical and 
6 diagnostic variables 



Exposure and 
Demographic Variables 



Symptom/clinical and Diagnostic 
Variables 



Animal contact Lethargy 

Tick bite Personality/behavioral changes 

Mosquito bite Seizure 

Insect bite Stiff neck 

Immunization Headache 

Recent infection Irritability 

Travel abroad Fever 

Travel within UK Focal neurological findings 

Raw fish Coma 

Untreated water Neurological signs 

Head trauma Gastrointestinal symptoms 

Sick person contact Respiratory symptoms 

Water Exposure Confusion 

Age Photophobia 

Gender Rash 

Urinary symptoms 

Duration of illness 

Length of Hospital Stay 

Abnormal white blood cell count (WCC) 

Abnormal magnetic resonance imaging (MRI) 

Abnormal computed tomography (CD 

Abnormal electroencephalography (EEG) 

Abnormal glucose 

Abnormal protein 



Where, S xx and S yy are the within-set covariance 
matrices for X and Y, respectively, and S xy is the 
between set covariance matrix. The solution is obtained 
by solving the following two eigenvalue problems [23,24] 



(S yy l SyxS xx l Sxy 



0 



XI) b = 0, 



Corr {a*x t bty) 



(1) 



where, the Eigen-values X, which sometimes are 
denoted by r 2 , represent the squared canonical correla- 
tions. The set of Eigen-vectors (a, b) corresponding to 
the leading eigenvalue are solutions to equation (1). The 
first canonical covariate is therefore the one which 
explains most of the relationship. CCA has been suc- 
cessfully applied in medical and epidemiological 
research [26,27] 

Covariance/Correlation matrix for categorical data 

Since data in this study consist mostly of binary vari- 
ables (presence or absence of a particular attribute), the 
usual correlation matrix, which is particularly designed 
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for continuous measurements would not be an appro- 
priate choice. Covariance or correlation matrices for 
categorical data have been previously considered by 
many and several formulations have been proposed to 
assess the strength of association between two categori- 
cal variables. Here we use the covariance/correlation 
matrix proposed by Okada et al. [29,30]. Their approach 
is a generalization of Gini's definition of variance or 
likeability for categorical data, which is also known as 
Ginis index [28-33]. 

Let X = {x 1} x 2 , x p } where Xi's are categorical vari- 
ables measured on n individuals. The ij th element of the 
variance-covariance matrix V (the covariance between x t 
and Xj when i*j and Vu is the variance of xj is calcu- 
lated as 

Vij = max (Qy (L)) , 

where, 

Where, L is an orthogonal matrix (orthogonal trans- 
formation) [30], in our case L = 1. When calculating 
variance, for instance, x ia = x ih = 1 if x ia * x ih and x ia - 
x ih - 0 if Xia= Xfo. The ij th element of the correlation 
matrix R can then be calculated as 




Simplified formulas for two special cases (binary and 
trinomial variables), using 2x2 and 3x3 contingency 
tables, can be found in Okada et al. [30,33]. We imple- 
mented the above variance-covariance/correlation for- 
mula in the R statistical software and used it in our 
CCA analysis. Pairwise available data were used when 
missing values occur. 

Statistical analysis is performed using the Canonical 
Correlation Analysis (CCA) and Significance Tests for 
Canonical Correlation Analysis (CCP) libraries in the R 
software package [34-36]. Parametric multivariate tests 
are not appropriate since our data consists of binary 
variables and hence violates the multivariate normality 
assumption. We, therefore, used a non-parametric per- 
mutation approach and calculated standard errors and 
p-values based on 10,000 permutations. 

Results and Discussion 

Our data set consists of 268 patients (152 from North 
West England, 94 from London and 22 from South 
West), of which 263 met the case definition (the case 
definition criteria are presented in the original paper 
our group recently published [10]), 208 of these patients 
are confirmed encephalitis cases (40 of the 208 cases are 



meningoencephalitis patients). We focused on these 208 
confirmed encephalitis patients for the CCA analysis in 
this paper; however, for comparison purposes, we have 
also performed the analysis on the 263 patients for 
whom the case definition was met. Summary statistics 
for our data on encephalitis patients is presented in 
Table 2. 

The results in Table 2 show that men are at a slightly 
higher (54%, n = 113) risk of encephalitis than women 
(46%, n = 95). This is in agreement with previous find- 
ings [13-15]. Most of the encephalitis patients are chil- 
dren and young adults (median age = 30, IQR = 45) 
where a large proportion of the patients are children of 
age < 10 (26%, n = 55) indicating that young children 
are at higher risk of developing encephalitis. The age 
distribution is quite uniform after age 10 where approxi- 
mately equal proportions of patients (9.6%, n = 20) are 
observed in 10 years age intervals. We, therefore, used 
10 as a cutoff point when dichotomizing age for the 
CCA analysis. 

Our results show that the majority of encephalitis 
patients (69.7%, n = 145) had been hospitalized for < 50 
days (median = 27; IQR: 43) and duration of illness is 
less than 100 days (median = 37, IQR = 46.25) for large 
proportion (80%, n = 167) of the patients. Consequently, 
we used 50 days and 100 days as cutoffs when dichoto- 
mizing hospital stay and duration of illness for CCA 
analysis, respectively. 

Overall, data on the encephalitis patients is sparse in 
nature where large proportion of zeroes (absence) than 
ones (presence) is observed for most of the variables 
(Figures 1 and 2). This is particularly the case for the 
exposure variables (Figure 1) with the exception of ani- 
mal contact (48.6% exposed), recent infection (37.5% of 
the patients have had recent infection) and sick person 
contact (26%). For instance, the percentage of patients 
exposed to tick and mosquito bites are only 3.4% (n = 
7) and 6.3% (n = 13), respectively. A considerable per- 
centage of patients had water exposure (18.3%) and have 
experienced head trauma (11.1%). 

On the other hand, symptom and diagnostic variables 
have relatively larger event rates (Figure 2) where vari- 
ables with the smallest rates are coma and photophobia 
which were observed on only 3.8% (n = 8) and 7.7% (n 
= 16) of the patients, respectively. Fever and abnormal 
white blood cell count (abnormal WCQare indicated as 
the two main characteristics of encephalitis where 77.9% 
and 76.9% of the patients had fever and abnormal 
WCC, respectively (Figure 2, Table 2). The results also 
show that personality and behavioral change, headache, 
lethargy and abnormal protein are the next most fre- 
quently occurring characteristics of encephalitis. Some 
missingness are observed in the exposure variables (Fig- 
ure 2); however, a significant amount of missing data 
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Table 2 Descriptive statistics for data on the 208 confirmed encephalitis patients 


Variables 


Present 


Absent 


Missing 


Exposufs Vorioblss ond Demographic 








Qpy (m3|p N ) 

-J^3A \J 1 I a \ \Z J 


1 1 ~J \.J^\.J /UJ 


95 (45 70/ 0 ) 


0 (0%) 


Age (< 10) 


55 (26.4%) 


146 (70.2%) 


7 (3.4%) 


Animal Contact 


101 (48.6%) 


95 (45.7%) 


12 (5.8%) 


Tick Bite 


7 (3.4%) 


1 88(90.4%) 


13 (6.3%) 


Mosquito Bite 


13 (6.3%) 


182 (87.5%) 


13 (6.3%) 


Insect Bite 


1 4(6.7%) 


179 (86.1%) 


15 (7.2%) 


Immunization 


14 (6.7%) 


182 (87.5%) 


12 (5.8%) 


Recent Infection 


78 (37.5%) 


1 1 3 (54.3%) 


1 7 (8.2%) 


Travel Abroad 


27 (13%) 


1 74 (83.7%) 


7 (3.4%) 


Travel UK 


31 (14.9%) 


164 (78.8%) 


13 (6.3%) 


i \a vv i ioii 


7 (3.4%) 


1 84(88 5%) 

I O i /UJ 


17 (8 2%) 

\ / /UJ 


Untreated Water 


6 (2.9%) 


185 (88.9%) 


17 (8.2%) 


Watpr Fynn^urp 


38 (12 3%) 

~JKJ \ I .-J /UJ 


1 57 (75 5o/ 0 ) 

\ ~J / \ / ~J ,~J /UJ 


13 (6 3%) 

1 ~J \\J.^J /UJ 


Head Trauma 


23 (1 1.1%) 


173 (83.2%) 


12 (5.8%) 


Sick Person Contact 


54 (26%) 


136(65.4%) 


18 (8.7%) 


Symptom ond Diagnostic Variables 








Abnormal CT 


51(24.5%) 


123 (59.1%) 


34 (16.3%) 


Abnormal MRI 


102 (49%) 


69 (33.2%) 


37 (17.8%) 


Abnormal EEG 


100 (48.1%) 


20 (9.6%) 


88(42.3%) 


Abnormal Glucose 


46(22.1%) 


84 (40.4%) 


78 (37.5%) 


Abnormal Protein 


1 24 (59.6%) 


71 (34.1%) 


13 (6.3%) 


Abnormal WCC 


1 60 (76.9%) 


42 (20.2%) 


6 (2.9%) 


Lethargy 


116 (55.8%) 


92(44.2%) 


0 (0%) 


Irritability 


77(37%) 


131 (63%) 


0 (0%) 


PB Change 


133 (63.9%) 


75 (36.1%) 


0 (0%) 


Seizure 


1 05(50.5%) 


1 03(49.5%) 


0 (0%) 


Stiff Neck 


46 (22.1%) 


1 62(77.9%) 


0 (0%) 


Headache 


125 (60.1%) 


83(39.9%) 


0 (0%) 


Fever 


162 (77.9%) 


46(22.1%) 


0 (0%) 


Focal-Neurological 


76 (36.5%) 


1 32(63.5%) 


0 (0%) 


Coma 


8 (3.8%) 


200(96.2%) 


0 (0%) 


Neurological 


63 (30.3%) 


145 (69.7%) 


0 (0%) 


Gl Symptoms 


103(49.5%) 


105(50.5%) 


0 (0%) 


Respiratory 


42 (20.2%) 


166 (79.8%) 


0 (0%) 


Confusion 


74(35.6%) 


134(64.4%) 


0 (0%) 


Rash 


25 (12%) 


183 (88%) 


0 (0%) 


Photophobia 


16 (7.7%) 


192 (92.3%) 


0 (0%) 


Urinary 


21(10.1%) 


187 (89.9%) 


0 (0%) 


Hospital Stay (< 50 days) 


145(69.7%) 


60(28.8%) 


3(1.4%) 


Duration of illness (< 100 days) 


167(80.3%) 


31(14.9%) 


10(4.8%) 



are observed in diagnostic variables where measure- 
ments from EEG and Glucose were missing for 42.3% (n 
= 88) and 37.5% (n = 78) of the patients, respectively 
(Table 2 Figure 2). Consequently, abnormal EEG, 
although previously shown to be one of the main indica- 
tors of encephalitis, is observed on only half of the 
patients (48.1%). Nevertheless, among patients with 
available EEG measurements (n = 120), 83.3% (n = 100) 
of them have abnormal EEG which is in agreement 



with previous findings. This is mainly because the diag- 
nostic decision tree often leads clinicians to carry out an 
EEG in patients with a high likelihood of it being abnor- 
mal. One of the triggers is seizures, for example. So 
patients with EEGs are a particular clinical cluster of 
their own. 

Heatmaps of within and between set correlations are 
presented in Figure 3 where dark blue and dark red col- 
ors indicate very strong correlations (a color indicator 
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Exposure Variables 



Sick_Person_Contact 
Head_Trauma 
Water_Exposure 
Untreated_Water 
Raw_Fish 
Travel_UK 
Travel_Abroad 
Recent_lnfection 
Immunaization 
lnsect_Bite 
Mosquito_Bite 
Tike_Bite 
Animal_Contact 
Age 
Sex 




20 



40 



60 



80 



100 



Percent 

Figure 1 Bar plot showing the distribution of exposure 
variables for the 208 encephalitis patients. Dark grey, light grey 
and white represent percentage of patients who are exposed, not 
exposed and missing exposure status, respectively. 



bar with ranges of correlations is presented under the 
heatmaps). Figure 3 indicates that, weak to moderate 
(-0.22-0.63) pair-wise correlations exist both within and 
between the X and Y sets of variables, in general where, 
the largest correlations are observed between length of 



Symptom Variables 



DurationJIIness 
Hospital_Stay 
Urinary 
Photophobia 
Rash 
Confusion 
Respiratory 
Gastrojntestinal 
Neurological 
Coma 

Focal_Neurological 
Fever 
Headache 
StiffJMeck 
Seizure 
PB_Change 
Irritability 
Lethargy 
Abnormal_WCC 
Abnormal_Protein 
Abnormal_Glucose 
Abnormal_EEG 
Abnormal_MRI 
Abnormal CT 



■ I 



I I 



i r 



20 



40 



60 



— r~ 

80 



100 



Percent 

Figure 2 Bar plot showing the distribution of symptom and 
laboratory measurements for the 208 encephalitis patients 

Dark grey, light grey and white represent percentage of patients for 
whom the symptoms are present, absent and symptom status 
missing, respectively. 



hospital stay and duration of illness (0.63), and between 
tick and insect bites (0.55). 

CCA produced min (p, q) = 15 canonical variates; p = 
IS is the number of variables in the X set and q = 24 is 
the number of variables in the Y set. However, only the 
first canonical variate is statistically significant at a = 
0.05 level. We will, therefore, discuss only the first cano- 
nical variate in this paper. 

The cross-correlation matrix displayed in Figure 3 
shows that weak pair-wise correlation exists between the 
risk factor (exposure and demographic) and outcome 
(symptom, clinical and diagnostic) variables. However, 
the first canonical solution/variate from CCA revealed 
strong multivariate correlation (p = 0.71, standard error 
(se) = 0.03, p-value = 0.013) between the two sets. We 
found a moderate correlation (p = 0.54, se = 0.02) 
between the variables in the second canonical variate, 
however, the value is not statistically significant (p-value 
= 0.68). 

The first canonical solution consists of two sets of 
variables: the linear combination of X set variables 
(exposure and demographic features) and the linear 
combination of the Y set variables (symptom, clinical 
and diagnostic features). Individual canonical loadings 
(structural coefficients) between these two sets of vari- 
ables with their corresponding canonical variates are 
presented in Table 3. 

The top ranked variable in the exposure set is age 
(loadings = 0.94) indicating that age contributed large 
amount of variation (88%) in the first canonical variate 
of exposure sets and hence the driving variable for the 
canonical variate The cross loading for age also shows 
that a considerable amount (45%) of the variation in the 
canonical variate of symptoms is explained by age. This 
result is in agreement with previous findings that 
showed that children are at an increased risk of develop- 
ing encephalitis compared to adults. Sick person contact 
and immunization also contributed considerably towards 
the first canonical variate with ladings of 0.47 and 0.27; 
and cross loadings of 0.34 and 0.22, respectively. The 
contribution of the rest of the exposure variables 
towards the variation in the first canonical variate is 
negligible. Variables that contributed the least include 
animal contact and sex, where only 0.25% the variation 
in the first canonical variate was attributed to these vari- 
ables. Variables that contribute to the first canonical 
variates of both sets are provided in a simple "finger 
plot" presented in Figure 4. 

Among the symptom and diagnostic variables, abnor- 
mal WCC, headache and confusion are the three top 
ranked variables contributing 27%, 26%, and 25% of the 
variation in the first canonical variate of the symptom 
sets, respectively. The other variables with a consider- 
able contribution towards the first canonical variate are 
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X correlation 



Y correlation 




Cross-correlation 



-1.0 -0.5 0.0 0.5 1.0 

Figure 3 Correlations within and between two sets of variables where X represents exposure and demographic variables, and Y 
consist of symptom, clinical and diagnostic variables. 



abnormal protein, PB change, length of hospital stay and 
duration of illness, explaining 15%, 12%, 9% and 9% of 
the variation, respectively. The canonical cross loadings 
also indicate that symptom variables, provided in Figure 
4, explain considerable amount of the variation in the 
first canonical variate of the exposure sets. 

Fever, although present in the majority of the patients 
(77.9%, Table 2), does not contribute much towards the 
first canonical variates, explaining only 0.04% and 0.16% 
of the variation in the symptom and exposure variates, 
respectively. 

We also performed CCA on the 263 patients who 
met the case definition criteria as presented in the ori- 
ginal paper [10]. In general, the pattern observed in 
the within and between correlations for this data set is 
similar to those obtained for the 208 confirmed ence- 
phalitis cases where weak to moderate correlations 
exist between the variables. A correlation of p = 0.68 
(p-value = 0.007) was obtained between sets of vari- 
ables in the first canonical solution. The second cano- 
nical solution resulted in p = 0.54 (p-value = 0.19). 
Overall, the canonical loadings for X and their rank- 
ings are similar to those presented in Table 3 and Fig- 
ure 4, respectively. Therefore, our analysis based on 



263 patients indentified the same sets of exposure vari- 
ables to be strongly associated with symptom, clinical 
and diagnostic variables. 

Redundancy coefficients indicate that very small 
amount of the variation in the original symptom vari- 
ables were explained by the exposure canonical variates. 
Only 6% of the variation in the symptom variables is 
explained by the first exposure canonical variate; 5% by 
the second canonical variate and 4% by the third. This 
indicates that, the variation in the symptoms might be 
caused by host factors rather than environmental and 
exposure factors. The idea that characteristics of the 
host may be more important than the pathogen is con- 
sistent with the observation that for some causes, such 
as herpes simplex virus (HSV), encephalitis is a rare out- 
come of a common infection. Another possible hypoth- 
esis, that might be drawn from our results, is the 
possibility that exposure and symptom variables might 
provide independent information towards understanding 
the etiology of encephalitis. Further case-control type of 
analysis based on exposure, symptom and host factors 
might shed light to better understanding of factors that 
might help facilitate diagnosis and treatment of ence- 
phalitis patients. 
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Table 3 Canonical loadings of individual variables in their respective canonical variates for the first canonical solution 
of the CCA 

Canonical Loadings (Structural Coefficients) 



Exposure and Demographic Variables Symptom and Diagnostic Variables 



jt:X tjlldlcj 


n 03 

U.U3 


AUl lUl 1 1 Idl L. 1 


n 07 
u.u/ 




n qa 


AUllUlllldl IVInl 


0 OR 
U.Uo 


Animal C nntart 
Ml III 1 Idl ^_UI ILcILL 


-0 04 


AUllUlllldl CCO 


0 22 


1 ILK DlLc 


n n^ 


Abnormal Glucose 


0 1 Q 
-U. I y 


Mn^ni litn Ritp 


0.1 1 


Abnormal Protein 


-0.39 


Incpr-f Ritp 

II 1 JCtL Ul LC 


-0.08 


Ahnnrmal \NCC 

/\kjl \\J\ 1 1 IQI vVLV, 


-0.52 


Immi ini7^tinn 

II 1 II 1 1 U 1 1 IZ.U LIUI 1 


0 27 


lcli idi yy 


0 27 


Recent Infection 


n 1 n 
-u. I u 


Irritability 


O 1Q 

U.zo 


Travel Abroad 


n ha 
-u.ud 


PR ChanriQ 

rD L-iiariye 


0 3£ 


Tmvpl UK 

i i a vci <u i\ 


-0.1 3 


Seizure 


0.24 


R21A/ Fi^h 
rva vv 1 loi i 


-005 


Stiff Nprk 

L 1 1 1 1 NCLI\ 


-0 1 2 


I Intrpptprl W^fpr 
UIILIcaLcU VvaLcl 


-005 


ncdUdLi ic 


-0 51 


Water Exposure 


0.18 


Fever 


-0.03 


Head Trauma 


0.13 


Focal-Neurological 


-0.06 


Sick Person Contact 


0.47 


Coma 


0.001 






Neurological 


-0.24 






Gl Symptoms 


0.08 






Respiratory 


0.13 






Confusion 


-0.50 






Rash 


0.11 






Photophobia 


-0.18 






Urinary 


-0.11 






Hospital Stay (< 50 days) 


0.30 
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0.30 



A correlation of p = 0.71 (p-value = 0.01) is obtained for the first canonical correlation. 



Conclusion 

We performed exploratory multivariate analysis using 
CCA to study associations between two sets of variables 
in encephalitis patients. One set consists of exposure 
and demographic variables including variables that are 
previously indentified in the literature as potential risk 
factors. The second set includes symptom, clinical and 
diagnostic variables where some items in the set have 
been shown to be important clinical characteristics of 
encephalitis. Although pair-wise cross correlations 
between the two sets of variables are weak to moderate, 



Risk Factors Symptoms 

,0.5 Abnormal WCC 
Age 0.9s. / y^'^ Headache 

Sick person contact 0.5-^\ p =0.71* ^^°' 5 Confusion 

Immunization 0.3— y ^£C ° A Abnormal P rotein 

Water exposure 0.2-^"^ X\°' 4 PB chan § e 

n. 0.3 Length of stay 
0.3 Duration of illness 

Figure 4 The top ranked variables in the first canonical 
solution and absolute value of their canonical loadings. A 

multivariate correlation of p = 0.71 (p-value = 0.01, indicated in the 
figure by *) is obtained between the two sets of variates. 



CCA revealed strong multivariate correlation between 
the two sets. 

Our analysis provided a set consisting of 3 exposure/ 
demographic variables (age, sick person contact, immu- 
nization and water exposure) to be strongly associated 
with 7 symptom/diagnostic variables (abnormal WCC, 
headache, confusion, abnormal protein, personality and 
behavioral change, length of stay and duration of illness) 
to be strongly associated. 

Our analysis also revealed that a very small amount of 
the variation in the symptom sets is explained by the 
exposure variables. This indicates that host factors, 
rather than environmental factors might be important 
towards understanding the etiology of encephalitis and 
facilitate early diagnosis and treatment of encephalitis 
patients. 

CCA is exploratory in nature and measures associa- 
tions rather than causation. However, our analysis 
indentified exposure variables that might be strongly 
associated with encephalitis and generated important 
hypotheses that can be investigated further to inden- 
tify risk factors that are predictive of encephalitis. A 
confirmatory case-control analysis involving 
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encephalitis and non-encephalitis patients is needed to 
indentify risk factors and important symptom variables 
that can be used to facilitate diagnosis. CCA results 
may, however, provide insight into potentially smaller 
sets of variables worth investigating further. Further- 
more, it is important to highlight that exposure vari- 
ables such as tick bite do not occur frequently in the 
UK and also do not often lead to encephalitis, and so 
are difficult to study using conventional methods such 
as logistic regression analysis. CCA can, therefore, be 
a useful tool in indentifying risk factors associated 
with human encephalitis and other rare and complex 
diseases where regression approaches may not be 
optimal. 
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